library(tidyverse)
library(DALEX)
library(mlr3)
library(mlr3learners)
library(mlr3tuning)
library(mlr3pipelines)
library(xgboost)
set.seed(20260216)
theme_set(theme_classic())Boosting and Pocket Measurements
Motivation
This example is based on the article “Pockets” by Jan Diehm and Amber Thomas for the pudding. Their question – are pocket sizes systematically smaller in women’s pants, to the point that they wouldn’t be able to hold common objects like wallets or smartphones? They gathered data on pocket characteristics from major brands.
pockets <- read_csv("https://github.com/the-pudding/data/raw/refs/heads/master/pockets/measurements.csv")
pockets# A tibble: 80 × 16
brand style menWomen name fabric price maxHeightFront minHeightFront
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Arizona skinny women Fave… 78% c… 42 14.5 15
2 Arizona strai… women Perf… 78% c… 42 14.5 14
3 Ralph Lauren skinny women Mode… 92% c… 89.5 13 13.5
4 Ralph Lauren strai… women Prem… 92% c… 89.5 13 13.5
5 Uniqlo skinny women Skin… 87% c… 39.9 13 13
6 Uniqlo strai… women High… 98% c… 39.9 15.5 12
7 Calvin Klein skinny women Midr… 98% c… 79.5 12 12
8 Calvin Klein strai… women Stra… 85% c… 69.5 14 11.2
9 Lucky skinny women Ava … 69% c… 99 13 14.5
10 Lucky strai… women Swee… 96% c… 79.5 15 16.5
# ℹ 70 more rows
# ℹ 8 more variables: rivetHeightFront <dbl>, maxWidthFront <dbl>,
# minWidthFront <dbl>, maxHeightBack <dbl>, minHeightBack <dbl>,
# maxWidthBack <dbl>, minWidthBack <dbl>, cutout <lgl>
Our approach is to predict whether pockts are men’s or women’s using all the available features. High accuracy would mean that pockets differ systematically in style, price, or pocket dimensions. We can then inspect variable importance to see which features drive the distinction.
Data Processing
We will convert character valued columns into to factors for easy encoding later on. We omit fabric and name (ID-like variables with many levels) and maxHeightFront (so straongly associated with the response that it would dominate interpretation and make this example less representative).
pockets <- pockets |>
select(!any_of(c("fabric", "name", "maxHeightFront"))) |>
mutate(across(where(is.character), as.factor))Here’s the processed version of the data.
pockets# A tibble: 80 × 13
brand style menWomen price minHeightFront rivetHeightFront maxWidthFront
<fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Arizona skin… women 42 15 6.5 16.5
2 Arizona stra… women 42 14 6.5 16
3 Ralph Lau… skin… women 89.5 13.5 6.5 14.5
4 Ralph Lau… stra… women 89.5 13.5 6.5 14.5
5 Uniqlo skin… women 39.9 13 5.5 14
6 Uniqlo stra… women 39.9 12 6.5 16.5
7 Calvin Kl… skin… women 79.5 12 6.5 14.5
8 Calvin Kl… stra… women 69.5 11.2 6.5 15.5
9 Lucky skin… women 99 14.5 6 13.5
10 Lucky stra… women 79.5 16.5 5.5 15.5
# ℹ 70 more rows
# ℹ 6 more variables: minWidthFront <dbl>, maxHeightBack <dbl>,
# minHeightBack <dbl>, maxWidthBack <dbl>, minWidthBack <dbl>, cutout <lgl>
Hyperparameter Tuning
Next, we define our hyperparameter tuning pipeline. The setup resembles the 03-gotv_cart.qmd analysis, but with two changes. First, we use classif.xgboost to specify that we want a boosting model. Second, we add a po("encode") preprocessing step to turn categorical features into numerical encodings. Although boosting can handle categorical types in theory, the xgboost implementation in mlr3 requires numerical predictors. We could do this manually, but the mlr3 pipeline manages it for us.
task <- as_task_classif(
pockets,
target = "menWomen"
)
learner <- as_learner(
po("encode") %>>%
lrn(
"classif.xgboost",
nrounds = to_tune(100, 300),
max_depth = to_tune(1, 3),
eta = to_tune(1e-3, 1e-1),
predict_type = "prob"
)
)
instance <- ti(
task = task,
learner = learner,
resampling = rsmp("cv", folds = 2),
measures = msr("classif.ce"),
terminator = trm("none")
)Now that we’ve setup the pipeline, let’s search over hyperparameters to find a reasonable choice.
tuner <- tnr("grid_search", resolution = 3, batch_size = 3)
tuner$optimize(instance) classif.xgboost.eta classif.xgboost.max_depth classif.xgboost.nrounds
<num> <int> <int>
1: 0.1 2 100
learner_param_vals x_domain classif.ce
<list> <list> <num>
1: <list[7]> <list[3]> 0.15
The cross entropy shows that quick growing trees (with some interactions) did the best. Cross-entropy might not be as intuitive as accuracy, but values near 0 mean perfect classification and near -\log (0.5) \approx 0.69 mean random guessing. Our model does much better than that baseline.
as.data.table(instance$archive) |>
rename_with(~ str_remove(., "classif.xgboost.")) |>
arrange(classif.ce) |>
select(1:4) eta max_depth nrounds classif.ce
<num> <int> <int> <num>
1: 0.1000 2 100 0.1500
2: 0.0505 1 200 0.1625
3: 0.1000 3 200 0.1625
4: 0.1000 2 200 0.1625
5: 0.0505 3 300 0.1625
6: 0.0505 2 200 0.1625
7: 0.0505 1 300 0.1625
8: 0.0505 3 200 0.1625
9: 0.1000 3 100 0.1625
10: 0.0505 2 300 0.1750
11: 0.1000 1 100 0.1750
12: 0.1000 1 200 0.1750
13: 0.1000 2 300 0.1750
14: 0.0505 2 100 0.1750
15: 0.1000 3 300 0.1750
16: 0.0505 1 100 0.1750
17: 0.0010 1 300 0.1875
18: 0.1000 1 300 0.1875
19: 0.0505 3 100 0.1875
20: 0.0010 3 200 0.2750
21: 0.0010 2 200 0.2750
22: 0.0010 2 300 0.2750
23: 0.0010 3 300 0.2750
24: 0.0010 1 100 0.5750
25: 0.0010 3 100 0.5750
26: 0.0010 1 200 0.5750
27: 0.0010 2 100 0.5750
eta max_depth nrounds classif.ce
<num> <int> <int> <num>
Refit Model
Let’s refit the model on the entire dataset (so far, we’ve been training on CV split versions). This is the model that we would use in practice, and it’s what we’ll interpret next.
learner$param_set$values <- instance$result_learner_param_vals
learner$train(task) # retrain on all data
fit <- learner$modelInterpretation
A simple check: compare predicted probabilities against the actual class. The probabilities barely overlap, meaning that the measured features predict men’s vs. women’s pocket type quite reliably.
The xgboost package gives its own variable importance measure. This looks at the variables which were split most often during training, as well as how much they increased node purity when they were split. We can see tha tthe top four variables are all related to the dimensions of hte pocket.
fit <- learner$model$classif.xgboost$model
fit <- attr(fit[[1]], "model") # messy, but seems to be only way to access original model....
xgb.importance(model = fit) Feature Gain Cover Frequency
<char> <num> <num> <num>
1: minHeightFront 0.621449765 0.360998572 0.270042194
2: minWidthFront 0.123103109 0.175654586 0.185654008
3: maxHeightBack 0.079520156 0.136094368 0.147679325
4: rivetHeightFront 0.060789738 0.077658123 0.075949367
5: cutout 0.042073073 0.056660149 0.054852321
6: price 0.031754325 0.085450628 0.126582278
7: minHeightBack 0.025086866 0.048867292 0.050632911
8: maxWidthBack 0.006768666 0.032342490 0.046413502
9: minWidthBack 0.003424694 0.017630925 0.033755274
10: maxWidthFront 0.003153266 0.003200917 0.004219409
11: style.slim 0.002876342 0.005441949 0.004219409
Finally, partial dependence plots. Each line represents one pocket from the data. By passing a custom predict_function argument, we can study changes in predicted probabilities rather than just class labels. Each curve shows how a pocket’s predicted probability shifts when we intervene on the plotted variables. The effect of minHeightFront is especially clear: once the minimum pocket height exceeds about 15, the model predicts men’s pocket.
explainer <- DALEX::explain(
learner,
data = pockets,
y = as.numeric(pockets$menWomen),
predict_function = \(model, newdata) {
model$predict_newdata(newdata)$prob[, 1]
}
)Preparation of a new explainer is initiated
-> model label : R6 ( default )
-> data : 80 rows 13 cols
-> data : tibble converted into a data.frame
-> target variable : 80 values
-> predict function : function(model, newdata) { model$predict_newdata(newdata)$prob[, 1] }
-> predicted values : No value for predict function target column. ( default )
-> model_info : package Model of class: GraphLearner package unrecognized , ver. Unknown , task regression ( default )
-> predicted values : numerical, min = 0.006793982 , mean = 0.5003302 , max = 0.9943342
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = 0.005665779 , mean = 0.9996698 , max = 1.993206
A new explainer has been created!
pdp <- model_profile(explainer, variables = c("minHeightFront", "minWidthFront", "rivetHeightFront"))
plot(pdp, geom = "profiles")Indeed, if we go back to the raw data, we can see a clear difference in these variables between mens and women’s pockets.